What is ggplot2?

ggplot2 is a data visualization package written by Hadley Wickham that uses the “grammar of graphics.” The grammar of graphics provides a consistent way to describe the components of graph, allowing us to move beyond specific types of plots (e.g., boxplot, scatterplot, etc.) to different elements that compose the plot. As the name would imply, the grammar of graphics is a language we can use to describe and build visualizations.

Today, we’ll look at the basic syntax of ggplot2 graphics, as well as some other tidyverse tools, using simulated regression data.

library(ggplot2)
library(dplyr)

If you are so inclined, all of the code for this document is on my Github page.

The data

First, we’ll define a function to generate data.

generate_data <- function(n, b0, b1, b2, bint, seed) {
  set.seed(seed)
  x1 <- rnorm(n = n, mean = 0, sd = 1)
  x2 <- sample(factor(c("Male", "Female")), size = n, replace = TRUE,
    prob = c(0.4, 0.6))
  x3 <- sample(factor(c("Caucasian", "Hispanic", "African American")), size = n,
    replace = TRUE, prob = c(0.5, 0.2, 0.3))
  e <- rnorm(n = n, mean = 0, sd = sqrt(10))
  
  y <- b0 + (b1 * x1) + (b2 * as.numeric(x2)) + (bint * x1 * as.numeric(x2)) + e
  data_frame(outcome = y, predictor = x1, gender = x2, race = x3)
}

And then, we will use that function to generate a sample for our example.

mlm_data <- generate_data(n = 1000, b0 = 3, b1 = 5, b2 = 3, bint = 4,
  seed = 9416)
mlm_data
#> # A tibble: 1,000 × 4
#>       outcome  predictor gender             race
#>         <dbl>      <dbl> <fctr>           <fctr>
#> 1  -1.1238447 -0.7430094   Male        Caucasian
#> 2  10.8342589  0.2046086 Female         Hispanic
#> 3  -1.7894618 -0.7236642   Male        Caucasian
#> 4   3.6140835  0.3188742 Female         Hispanic
#> 5   6.2861291  0.1414323 Female         Hispanic
#> 6  -8.8799150 -1.2760527   Male        Caucasian
#> 7   0.3724540 -0.8999638   Male        Caucasian
#> 8  33.7692404  1.5350762   Male        Caucasian
#> 9  24.5912195  1.4025014   Male        Caucasian
#> 10 -0.4575962 -0.6454665   Male African American
#> # ... with 990 more rows

Using ggplot2

Because ggplot2 is built on the grammar of graphics, the code for almost all plots will follow the same format.

ggplot(data = <data>, mapping = aes(<mappings>)) +
  geom_<element>()

In this structure data defines the data for the plot, mapping defines how the aesthetics are mapped to different variables, and the geom commands add elements to the plot. For example, using our simulated data, we can map the predictor to the x-axis, the outcome to the y-axis, and make a scatterplot.

ggplot(data = mlm_data, mapping = aes(x = predictor, y = outcome)) +
  geom_point()

We could also make a bar plot to show the number of respondents from each group.

ggplot(data = mlm_data, mapping = aes(x = race)) +
  geom_bar()

Or we could make a histogram to look at the distribution of our outcome variable.

ggplot(data = mlm_data, mapping = aes(x = outcome)) +
  geom_histogram()
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Notice that for the barplot and histogram we did not define an aesthetic for the y-axis. By default ggplot2 will calculate the count for each value on the x-axis. For each geom, the help pages will tell you which aesthetics are required, and which other aesthetics can be specified if desired (e.g., ?geom_histogram).

Altering the default plot

Let’s go back to our scatterplot to look at how we can change the details to look more like what we want.

ggplot(data = mlm_data, mapping = aes(x = predictor, y = outcome)) +
  geom_point()

We can change aspects of the geom itself by adding arguments to the geom call.

ggplot(data = mlm_data, mapping = aes(x = predictor, y = outcome)) +
  geom_point(color = "blue", size = 3, alpha = 0.3, shape = 15)

Here, we’ve make the dots square, bigger, blue, and slightly transparent. A full list of available shapes is available here.

It is also possible to map these aesthetics to variables in the dataset, just like we did with the axes.

ggplot(data = mlm_data, mapping = aes(x = predictor, y = outcome,
  color = gender)) +
  geom_point()

Now each gender has its own color, and a legend is automatically generated. We can also mix aesthetics that are and are not mapped to variables.

ggplot(data = mlm_data, mapping = aes(x = predictor, y = outcome,
  color = gender)) +
  geom_point(shape = 15, alpha = 0.6, size = 3)

Here, color is still assigned to gender, but the shape and alpha aesthetics are applied to the entire geom.

Layering geoms

Often, we want to add additional elements to our plots. This is straightforward using ggplot2, we simply add another geom.

ggplot(data = mlm_data, aes(x = predictor, y = outcome)) +
  geom_point() +
  geom_smooth(method = "lm")

By default, geom_smooth uses method = "gam" for sample greater than or equal to 1000, but we can choose a linear model by using method = "lm". Just like before we can also map aesthetics to other variables.

ggplot(data = mlm_data, mapping = aes(x = predictor, y = outcome,
  color = gender)) +
  geom_point() +
  geom_smooth(method = "lm")

It’s also possible to apply map aesthetics to additional variables for only specific geoms.

ggplot(data = mlm_data, mapping = aes(x = predictor, y = outcome)) +
  geom_point() +
  geom_smooth(mapping = aes(color = gender), method = "lm")

Notice that we’ve moved the color mapping to the geom_smooth call. This results in a different smoothed line for each group, but this is not extended to the points. Aesthetics that are defined in the top ggplot call are global and get applied to all geoms, whereas aesthetics defined within the geom are local and apply only to that specific geom.

This can also be applied to data. For example, we could only plot points from the Hispanic group, but use the full data set to fit the smooth lines.

ggplot(data = mlm_data, mapping = aes(x = predictor, y = outcome)) +
  geom_point(data = filter(mlm_data, race == "Hispanic")) +
  geom_smooth(mapping = aes(color = gender), method = "lm")

Splitting apart groups

Sometimes it can be beneficial to look at groups separately, rather than together in a single plot. This can be accomplished with facetting.

ggplot(data = mlm_data, mapping = aes(x = predictor, y = outcome)) +
  geom_point() +
  geom_smooth(method = "lm") +
  facet_wrap(~ gender)

It may also be helpful to plot the full data within each facet and just highlight the specific group. This can be accomplished by using two calls to geom_point, and removing the facetting variable in the first.

ggplot(data = mlm_data, mapping = aes(x = predictor, y = outcome)) +
  geom_point(data = select(mlm_data, -gender), alpha = 0.5) +
  geom_point(mapping = aes(color = gender), alpha = 0.5) +
  geom_smooth(method = "lm") +
  facet_wrap(~ gender)

Making it look pretty

So far we’ve looked at how we can use geoms and aesthetics to create the elements of a plot. But ggplot2 also provides methods for formatting the plots to look exactly how you want. For example, we can add titles and change scales.

ggplot(data = mlm_data, mapping = aes(x = predictor, y = outcome)) +
  geom_point(data = select(mlm_data, -gender), alpha = 0.5) +
  geom_point(mapping = aes(color = gender), alpha = 0.5) +
  geom_smooth(method = "lm") +
  facet_wrap(~ gender) +
  labs(
    x = "An important predictor",
    y = "Representative outcome",
    title = "An important finding",
    subtitle = "More details about this very important thing"
  ) +
  scale_x_continuous(breaks = seq(-5, 5, 1)) +
  scale_y_continuous(breaks = seq(-100, 100, 10))

We can also define the colors that get used.

ggplot(data = mlm_data, mapping = aes(x = predictor, y = outcome)) +
  geom_point(data = select(mlm_data, -gender), alpha = 0.5) +
  geom_point(mapping = aes(color = gender), alpha = 0.5) +
  geom_smooth(method = "lm") +
  facet_wrap(~ gender) +
  labs(
    x = "An important predictor",
    y = "Representative outcome",
    title = "An important finding",
    subtitle = "More details about this very important thing"
  ) +
  scale_x_continuous(breaks = seq(-5, 5, 1)) +
  scale_y_continuous(breaks = seq(-100, 100, 10)) +
  scale_color_manual(values = c("red", "blue"))

Basically anything you want to change about the looks can be altered with scales or themes.

ggplot(data = mlm_data, mapping = aes(x = predictor, y = outcome)) +
  geom_point(data = select(mlm_data, -gender), alpha = 0.5) +
  geom_point(mapping = aes(color = gender), alpha = 0.5) +
  geom_smooth(method = "lm") +
  facet_wrap(~ gender) +
  labs(
    x = "An important predictor",
    y = "Representative outcome",
    title = "An important finding",
    subtitle = "More details about this very important thing"
  ) +
  scale_x_continuous(breaks = seq(-5, 5, 1)) +
  scale_y_continuous(breaks = seq(-100, 100, 10)) +
  scale_color_manual(values = c("red", "blue")) +
  theme_bw() +
  theme(
    legend.position = "bottom",
    panel.grid.minor.x = element_blank(),
    plot.title = element_text(face = "bold"),
    plot.subtitle = element_text(face = "italic"),
    axis.title = element_text(size = 8)
  )

To format legends, we can use the guides function.

ggplot(data = mlm_data, mapping = aes(x = predictor, y = outcome)) +
  geom_point(data = select(mlm_data, -gender), alpha = 0.5) +
  geom_point(mapping = aes(color = gender), alpha = 0.5) +
  geom_smooth(method = "lm", color = "gold") +
  facet_wrap(~ gender) +
  labs(
    x = "An important predictor",
    y = "Representative outcome",
    title = "An important finding",
    subtitle = "More details about this very important thing"
  ) +
  scale_x_continuous(breaks = seq(-5, 5, 1)) +
  scale_y_continuous(breaks = seq(-100, 100, 10)) +
  scale_color_manual(values = c("red", "blue")) +
  theme_bw() +
  theme(
    legend.position = "bottom",
    panel.grid.minor.x = element_blank(),
    plot.title = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(face = "italic", size = 12),
    axis.title = element_text(size = 10),
    legend.title = element_text(size = 10),
    legend.text = element_text(size = 8)
  ) +
  guides(
    color = guide_legend(title = "Gender", title.position = "top",
      title.hjust = 0.5, label.position = "bottom", label.hjust = 0.5,
      keywidth = unit(1, "cm"), override.aes = list(alpha = 1, size = 3))
  )

As can be seen from this last plot, the downside to ggplot2 is that the code to create a plot can become quite verbose. However, this is because we are able to alter almost any aspect of the plot.

Endless possibilities

So far, we’ve only talked in detail about a few commands that would be beneficial for creating plots typical of a regression. However there are many more geoms, scales, and theme options to create almost any type of graphic you can think of.

For example, we can plot how student adapt through different levels of an adaptive assessment.

Or we can look at the probability of a respondent providing the correct response to an item, given their ability, in different types of psychometric models.

We can also use heat maps to compare the amount of error present for combinations of variables.

Alternatively we could do more fun things like look at which US cities have the most breweries.

Or look at the distribution of brewery ratings for the surrounding states.

Finally, there are many extensions to ggplot2 (like the gganimate package from David Robinson), which we can use to plot the probability of KU winning a basketball game over time.

ani_example

ani_example

I really don’t think it’s an exaggeration to say the possibilities are endless!

Additional Resources

Session Information

devtools::session_info()
#> Session info --------------------------------------------------------------
#>  setting  value                       
#>  version  R version 3.3.2 (2016-10-31)
#>  system   x86_64, darwin13.4.0        
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_US.UTF-8                 
#>  tz       America/Chicago             
#>  date     2017-02-01
#> Packages ------------------------------------------------------------------
#>  package      * version    date       source                           
#>  animation    * 2.4        2015-08-16 cran (@2.4)                      
#>  assertthat     0.1        2013-12-06 CRAN (R 3.3.0)                   
#>  backports      1.0.4      2016-10-24 cran (@1.0.4)                    
#>  codetools      0.2-15     2016-10-05 CRAN (R 3.3.2)                   
#>  colorspace     1.2-6      2015-03-11 CRAN (R 3.3.0)                   
#>  DBI            0.5-1      2016-09-10 cran (@0.5-1)                    
#>  devtools       1.12.0     2016-06-24 CRAN (R 3.3.0)                   
#>  digest         0.6.11     2017-01-03 cran (@0.6.11)                   
#>  dplyr        * 0.5.0      2016-06-24 CRAN (R 3.3.0)                   
#>  evaluate       0.10       2016-10-11 CRAN (R 3.3.0)                   
#>  gganimate    * 0.1        2016-09-22 Github (dgrtwo/gganimate@26ec501)
#>  ggplot2      * 2.2.1.9000 2017-01-26 Github (hadley/ggplot2@2a1bf98)  
#>  gtable         0.2.0      2016-02-26 CRAN (R 3.3.0)                   
#>  htmltools      0.3.5      2016-03-21 cran (@0.3.5)                    
#>  knitr        * 1.15.1     2016-11-22 cran (@1.15.1)                   
#>  labeling       0.3        2014-08-23 CRAN (R 3.3.0)                   
#>  lazyeval       0.2.0.9000 2016-09-19 Github (hadley/lazyeval@c155c3d) 
#>  magrittr       1.5        2014-11-22 CRAN (R 3.3.0)                   
#>  maps         * 3.1.1      2016-07-27 CRAN (R 3.3.0)                   
#>  memoise        1.0.0      2016-01-29 CRAN (R 3.3.0)                   
#>  munsell        0.4.3      2016-02-13 CRAN (R 3.3.0)                   
#>  plyr           1.8.4.9000 2016-11-03 Github (hadley/plyr@fe19241)     
#>  purrr        * 0.2.2.9000 2016-11-22 Github (hadley/purrr@5360143)    
#>  R6             2.2.0      2016-10-05 cran (@2.2.0)                    
#>  RColorBrewer * 1.1-2      2014-12-07 CRAN (R 3.3.0)                   
#>  Rcpp           0.12.9.1   2017-01-24 Github (RcppCore/Rcpp@5a99a86)   
#>  readr        * 1.0.0.9000 2016-09-17 Github (hadley/readr@37d6eda)    
#>  rmarkdown      1.3        2016-12-21 cran (@1.3)                      
#>  rprojroot      1.1        2016-10-29 cran (@1.1)                      
#>  scales         0.4.1      2016-11-09 CRAN (R 3.3.2)                   
#>  stringi        1.1.2      2016-10-01 CRAN (R 3.3.1)                   
#>  stringr        1.1.0      2016-08-19 cran (@1.1.0)                    
#>  tibble         1.2-15     2017-01-11 Github (hadley/tibble@3d6f8b4)   
#>  tidyr        * 0.6.1.9000 2017-01-24 Github (hadley/tidyr@0f9a5da)    
#>  withr          1.0.2      2016-06-20 CRAN (R 3.3.0)                   
#>  yaml           2.1.14     2016-11-12 cran (@2.1.14)